The first technique we’re going to try out is text mining. This is the name for a family of tools used to analyse text at scale - rather than reading words we’ll use quantitative methods to search for patterns. There are three options to choose from:
The instructions for options 1 and 2 are contained within this page, but to use the interactive notebook, you need to load an interactive environment called MyBinder.
First, open the following in a new window: https://mybinder.org/v2/gh/yann-ryan/dh_intro_gates/main?urlpath=rstudio
This will starting loading a new Binder instance - an interactive coding environment. it might take a minute or two (so it might be worth going through some of the tutorials below while you’re waiting).
Once it has finished, you should see this screen:
This is called R-Studio: an application designed for writing and running code. We’re going to open a pre-made ‘notebook’. The bottom-right pane contains a list of files. Look for one called ‘network_analysis_r.Rmd’ and click on it. It will open the notebook on the top-left pane.
Voyant Voyant-tools.org is a collection of text mining tools with a user interface. It’s a great way in to text mining.
Open a browser and go to https://voyant-tools.org/
### Load a pre-existing text
The front page of Voyant allows you to add texts, either one or multiple. It also has a limited selection of texts pre-loaded, so if you would prefer to load one of these and just spend some time playing around with the interface, that is a good place to start. Click ‘Open’ and choose a corpus containing the entire works of either William Shakespeare or Jane Austen. If you do this, you can skip straight to the ‘Voyant Interface’ section below, if you like.
There are two ways to load your own text(s): either by entering a url to a document or by uploading a file from your own computer. We’ll use plain text files (the type created by notepad or TextEdit), but you can also upload HTML, XML, and various others (see the help file for more details).
First we need some books in plain text format. Project Gutenberg is a site containing a large database of freely-available ebooks, in a variety of formats. Let’s make a corpus containing the four Sherlock Holmes novels.
First, open a new tab in your browser, and find the book you’d like to add on the site, using the search or browse function. The book page will allow you to download the eBook in a number of formats. Once you’ve found a book of interest, right click on the ‘Plain Text UTF-8’ link, and click ‘copy the link address’ (in Chrome):
Return to the Voyant-tools tab, and paste the link into the input box. Switch back to Project Gutenberg and find the other three novels and repeat the process. Copy over the links and put one on each line:
Click ‘Reveal’ to load the texts in Voyant.
If you’re not interested in literature, or if you have a set of documents from your own research you’d like to analyse, you can add text from any URL, or from your computer. There are many sources of text files available, although in many cases, you’ll have to first download and unzip a set of files before uploading them.
Some to try include (however many of these are large files or require some additional steps, and might be best to try offline after the workshop):
Follow the instructions here: https://glam-workbench.net/trove-harvester/ to bulk download text files from Australian historical newspapers.
The Enron corpus (https://www.cs.cmu.edu/~enron/) if you are interested in more recent text for analysis.
Whether you loaded a pre-existing text, or added your own, you should now be presented with the following screen:
There’s a lot going on here at first, so I’ll break it down. The screen is divided into five separate panes: three at the top and two underneath, displaying a range of standard text mining tools. Voyant has many more tools available, and you can swap out the default ones for others. If you hover over the top-right of a pane, you’ll see three new options. If you click on the windows icon (second from the left), you can select a new tool to replace the current one. You can do this for any of the windows.
The default tools are, clockwise from the top-left:
Some of the windows have additional pages. Click on ‘terms’ in the word-cloud (top-left) and instead of a wordcloud you’ll get a count of the occurences of the top terms.
The windows are connected to each other: for example, if you click on a word in the word cloud in the top-left window, you’ll see the frequency of that word in the trends pane on the right.
One typical text mining question is to use what’s known as the ‘type-token ratio’ to compare the writing style of a set of documents. It’s the total number of unique words (known as types) divided by the total number of words (tokens). The ratio of the two can be interpreted as the ‘richness’ of the vocabulary in a particular text.
To see this, click on the documents tab in the summary window (bottom-left by default)
We can see that there is some difference between the Sherlock Holmes novels (though we need to be careful with the interpretation: longer novels will naturally have a smaller ratio: it would be surprising if an author’s use of unique words continued to increase as they wrote longer novels. The two final novels are a very similar length and most easily comparable, and they have very similar ratios).
Spend some more time trying out Voyant tools. Swap out the default windows for some other ones, and note any interesting observations.
If you’d like to get a flavour of how you can use a programming language can be used to analyse text, you have a couple of options. I’ve put together a very short demo of R in an interactive document called a ‘notebook’, here.
This notebook loads in an interactive environment called Binder. Once the link above has loaded, you’ll see an interactive document with instructions.
A third option is to use a service called Constellate, run by the Journal database Jstor. This allows you to build and analyse a corpus of JStor articles, using search terms. They take some time to initialise, so for now it’s best to use an existing one.
First, go to https://constellate.org/ and click on ‘dashboard’ on the top-right. You have the option of building a new dataset or selecting a featured dataset. For now, try out one of the featured datasets. Click ‘analyze’, and you’ll get a pop-up window containing links to a series of notebooks—interactive documents containing code and text. If you have never used Jupyter notebooks before, start with the first tutorial to learn how to use them - otherwise feel free to check out other ones more specific to text mining.
After this workshop, I can recommend playing around with the corpus builder, which will construct a dataset suitable for text mining from a set of parameters and keywords.